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Abstract — System performance and reliability are Jointly 
assessed for highly reliable communication/computer networks. 
The model assumes that at most a small number of components 
can be down at a time and that the average repair/replacement 
time of a failed component Is small when compared with the 
average failure times of network components. The system perfor- 
mance is measured in terms of network throughput at steady -state 
operation. 

1. INTRODUCTION 

Perhaps the most important factors in judging the quality 
of a communication or computer network are its perfor- 
mance and reliability. If two systems have the same perfor- 
mance characteristics, one anticipates that in the long run 
more reliable systems will perform better than less reliable 
ones. However, when comparing two (or more) systems 
with both different performance characteristics and dif- 
ferent reliabilities (availabilities), it is not always clear 
which system will be superior overall. 

This paper concentrates on jointly evaluating perfor- 
mance and reliability for a communication/computer net- 
work in which 1) the performance criterion is system 
throughput, and 2) the network components are 
themselves highly reliable. The approach to this problem 
follows the methodology recently developed in [2], A brief 
overview, a history of network reliability, and a short 
bibliography are also in [2], It is assumed that the traffic 
flow in the network can be modeled by a network of 
Markov queues and that there is only one type of traffic 
(for instance, data packets). 

The motivation and system description are described 
in the section 2. Section 3 describes the methodology to 
calculate the average throughput and the availability of a 
network with highly reliable components. Section 4 briefly 
discusses how the steady-state network throughput is ob- 
tained. Section 5 contains a few examples and section 6 is 
the concluding remarks. 



2. SYSTEM DESCRIPTION & ASSUMPTIONS 

In communication/computer networks, the failures of 
some components (nodes or links) can result in degraded 
performance of the network. The network fails if it reaches 
a state where its performance is not acceptable. Network 
failures usually have two failure modes: 

• Connective failure, where failures of some com- 
ponents result in some disconnected nodes. 

" Congestive failure, where the failures of some com- 
ponents result in overload and congestion in remaining 
components, blocking of the incoming traffic, buffer 
overflow, etc. 

Assumptions 

• The network components are highly reliable; com- 
ponent MTTFs are large (order of magnitude: years or 
months, say). The failure times are exponentially 
distributed. 

• The repair times are much shorter than MTTF 
(order of magnitude: days or hours). 

• The call/packet/message interarrival times and pro- 
cessing times are much shorter than MTTR (order of 
magnitude: seconds or microseconds). 

• The network is in a steady state; when a failure oc- 
curs, the traffic flow in the remaining working components 
reaches steady-state quickly (order of magnitude of this 
time to reach steady-state: seconds or minutes). 

• The repaired components are as good as new. 

• Components have 2 states: up and down. 

The assumptions are not very restrictive, fit a large number 
of real networks, and render the problem tractable. 



3. THE MODEL 



Notation 


N 


number of components in the network 


I 


number of nodes in the network 


X, 


failure rate, ; = 1, .... M 


M 


number of distinct failure types 


Di 


downtime due to failure type i 


P. 


steady state probabilities, /' - 0, .... M 


nk) 


throughput of the network while the network state 




isk: k = 0 t ...,M 


AT 


average network throughput 


AV 


network availability 
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Consider a network having N components (nodes + links). 
The components are highly reliable. From time to time, 
however, they fail. Sometimes even a group of components 
fails simultaneously. When a component(s) fails, a 
repair/replacement procedure is initiated. After repair, the 
component(s) is back to full function. The downtimes are 
relatively short when compared with failure times (up- 
times). Therefore, for simplicity, we assume that the prob- 
ability of another component(s) failure during the 
downtime is zero. More precisely, we assume that a com- 
ponent failure (or simultaneous failure of a group of com- 
ponents) inhibits any future failures till the completion of 
the repair. There are M distinct failure types (states), each 
state having a positive probability of occurrence. 

Under these assumptions, the behavior of the network 
can be modeled by the following stochastic process X(t): 
Initially, the process X(t) spends an exponential amount of 
time (with !he mean 1/X) at state 0, the state in which all 
the network components are up. When a failure occurs, 
with the probability X,/A it is of the type /: / = I , M\ (A 

Af 

- E X„ X, > 0). The downtime due to the failure type /, 

D it has a general distribution. After completion of the 
repair, the process always returns to state 0 and the process 
oscillates between the up state 0 and some down state / (see 
figure 1). Equivalently: When the system returns to the 
state 0, then M independent Poisson processes start 
simultaneously. 
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Pj ~ Pj' Po, j = 1, M 



(3.2) 
(3.3) 



Moreover, using theorem 3.6.1 [3, p 78J, the average net- 
work throughput is: 

^ _ E{ throughput during one cycle} 
E {cycle time} 



r(0) + E r{k) • Pk 



]/[l + | Pk 



= E r{k) ■ p k 
(3.4) 



Cycle time is the time between two successive regeneration 
points. 

r(fr), the throughput at state k t is defined as the 
equilibrium throughput of the network while the network 
state is k. 

Another measure of interest is the network availabili- 
ty. In this application the network is failed (down) if some 
of its nodes are disconnected or if the network is congested 
(not able to accept the entire arriving traffic). Network 
availability, in the usual manner, is: 



^ v h E{network uptime during one cycle} 
E{cycle time} 



(3.5) 



Let I(k) and J(k) be indicator functions 

'<*> a (J: 



network being in the state k is connected, 
otherwise; 



J(k) s f 1, network ^eing in the state * is not congested, 
(0, otherwise. 

It again follows from theorem 3.6.1 [3] that the network 
availability (3.S) is: 



Fig. 1 . A Typical Realization of the Process X{t) 

Process / corresponds to failure type i and has a 
failure rate X,. When the first failure occurs then, with the 
probability X.-/A, it is of the type / and the system enters 
repair stage / during which no new failures can be 
generated. 

The instants at which X(t) changes from state / to state 
0 can be regarded as the regeneration points of the process. 
It follows from the theory of regenerative processes (see 
Ross [3], or for more details [2]) that: 

Pj = lim Pr{A r (0 = j }, j = 1 M the steady-state 

l-A 

probabilities of the process X(t) are: 

p 0 = l/(l + |p,), (3-D 



AV= E /(*) • J(k) • p k . 



(3.6) 



4. NETWORK THROUGHPUT MEASUREMENT 

It is assumed that the traffic flow in the network in any 
state can be modeled by a network of Markov queues (for 
details see Kelly [1]). 

Notation 

When the network is in state k, (k = 0, 1, M) — 

v{k) network input traffic rate (offered traffic rate) 
v*(k) maximal network throughput 
gtik) probability that a packet (call or message) appears 
initially at node /; 
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/</(*) probability that a packet finishing service at node i 
goes next to node j; 

1 - £ fij{k) probability that a packet leaves the net- 
work at node /; 
iti(k) service rate at node /; 

for all nodes i, j = 1, /. 

Schweitzer [4] has shown that the maximal throughput 
v*(k) through the network (being at state k) is: 



[max^*),...,^*))]-' 
a**) e <?,(*)/*■(*), i — 1 , . . . , / 



(4.1) 
(4.2) 



e t {k)* ' = 1» / is the unique solution of the system of 
equations 

em = 8i(k) + E ej{k) */;.<*), / = 1. .... /■ (4.3) 

In matrix notation, when we denote %{k) = (gi(k) t 

gmf, e(/r) = {em, .... eKAr)) r and ¥(k) = (f v ) IJal , 

the system (4.3) can be written as: 



and has the solution: 
e(k) = (l-¥ T (k)r i g(k). 



(4.4) 



One can then define r{k) t the throughput of a network in 
state k t as 



r(k) sminM*), »*(*)}. 



(4.5) 



The index for which v*(k) = a/Jc)' 1 identifies the bot- 
tleneck node. 



5. EXAMPLE 

For clarity and simplicity in the calculations, consider a 
small symmetric network layout with 4 nodes and 6 links, 
as shown in figure 2a. Only links can fail, and there are 6 
possible failure states — state /' (/ - l f ...» 6), is defined as 
the state in which the link / failed. 

Let X, « 1/year and E{D t ] « 2/365 years, for all / = 1, 
6. Thenpo = 0.968, andp t = p 2 = ... = p 6 = 0.0053. 
Select % T (k) = (0.25 , 0.25 , 0.25 , 0.25), ^(k) = 0.5, ^(k) 
= 0.6, ^{k) = 0.7 and fi 4 (k) = 0.8 for all* = 0, 1, .... 6. 
The node processing rates are measured in (1/msec). When 
the link connecting node j fails, the link / traffic will be 
equally shared by the two remaining links connecting the 
node / with the rest of the network. 

When the network is in state 0, then ei(0) = e 2 (0) = 
*j(0) = e 4 (0) = e and the system (4.3) reduces to one equa- 
tion: e - (1/4) + (1/6) ■ (e + e + e). Solving this gives e 
- 0.5, and — 




F(0) = 



i i 
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(o) In the stote 0 ( fully operational state) 
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(4.3a) ,n the stQle 1 < 11nk *' is dow n) 



Fig. 2. Network Layout and the Routing Matrix 

,.(0) - [maxf_L_ , _L_ , _J_ f _l_]T l 
i L 2^i(0) 2fi 2 {0) 2^(0) 2^(0) jj 

= [max(l, 0.83, 0.71,0.61)]-* = 1 

When the network is in state 1 [see figure 2b], the system 
(4.3) becomes — 



4 6 



4 4 6 4 



4 4 6 4 



e 4 (l) =1 + 1 (<?j( i) + e3(1)) 
4 6 



Because of symmetry, we must have ei(l) = e 4 (l) = e and 
e 2 (l) = e 3 (l) ss /, and it is easy to see that e = 7/16 and/ 
= 9/16. Thus e r (l) = (7/16, 9/16, 9/16, 7/16) and 
= [max(0.8750, 0.9375, 0.8036, 0.7031)]" 1 = 1.07. 

Analogously we get: 
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e r (2) 


= (7/16, 7/16, 9/16, 9/16); 


**(2) 


= 1.14 


e r (3) 


= (9/16, 7/16, 7/16. 9/16); 


**<3) 


= 0.88 


e T (4) 


= (9/16, 9/16, 7/16, 7/16); 


,*(4) 


= 0.88 


e T (5) 


= (7/16, 9/16, 7/16, 9/16); 


"♦(5) 


= 1.07 


e r (6) 


= (9/16, 7/16, 9/16, 7/16); 


"*(6) 


= 0.88 



In the failure states 1, 2, 5, the maximum throughput rate 
is higher than the maximum throughput rate in the fully 
operational network. This paradox can be explained by the 
fact that the selection of alternative routings puts a lesser 
load on the slowest nodes, thus creating favorable condi- 
tions for better throughput. 

Let the offered traffic rate of the network v{jk) = v = 
0.95 for all k = 0, 1, 6. The states with unacceptable 
performance are the states 3, 4, 6. Using (3.4) and (4.5), 
the average throughput of the network is — 

AT = 0,95(p o + Pi + Pi + Ps) 

+ 0.88G7 3 + p 4 + p 6 ) = 0.9486. 

In this example we have only the congestive type of failures 
and therefore the connectivity failure indicator function 

I(k) = 1 for all k = 0, 1 6. The indicator function for 

congestive failure is — 

J(k\ = f 1 ' for * = °' 2 and 5; 
1 J (0, for k = 3, 4, and 6. 

Thus the availability of our network is: 

AV = p 0 + p x + p 2 + Ps = 0.9839. 

SUMMARY 

In this application, throughput rate is a measure of perfor- 
mance of a communication/computer network. The 
throughput and availability are assessed for a highly 
reliable network. 



The assessment produces a single figure of merit and 
can be a valuable tool for network designers and network 
operations managers because it helps to evaluate potential 
measures for improvements, suggests routing alternatives, 
and estimates important parameters for both new and ex- 
isting networks. However, the design of a communication/ 
computer network is a very complex task which involves a 
wide range of factors. The designer is facing many objec- 
tives, some of which can be contradictory. The present 
model is just one building block of an entire decision- 
support system. 
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